46 research outputs found

    Cali-Sketch: Stroke Calibration and Completion for High-Quality Face Image Generation from Poorly-Drawn Sketches

    Get PDF
    Image generation task has received increasing attention because of its wide application in security and entertainment. Sketch-based face generation brings more fun and better quality of image generation due to supervised interaction. However, When a sketch poorly aligned with the true face is given as input, existing supervised image-to-image translation methods often cannot generate acceptable photo-realistic face images. To address this problem, in this paper we propose Cali-Sketch, a poorly-drawn-sketch to photo-realistic-image generation method. Cali-Sketch explicitly models stroke calibration and image generation using two constituent networks: a Stroke Calibration Network (SCN), which calibrates strokes of facial features and enriches facial details while preserving the original intent features; and an Image Synthesis Network (ISN), which translates the calibrated and enriched sketches to photo-realistic face images. In this way, we manage to decouple a difficult cross-domain translation problem into two easier steps. Extensive experiments verify that the face photos generated by Cali-Sketch are both photo-realistic and faithful to the input sketches, compared with state-of-the-art methodsComment: 10 pages, 12 figure

    Specialist or Generalist? Instruction Tuning for Specific NLP Tasks

    Full text link
    The potential of large language models (LLMs) to simultaneously perform a wide range of natural language processing (NLP) tasks has been the subject of extensive research. Although instruction tuning has proven to be a data-efficient method for transforming LLMs into such generalist models, their performance still lags behind specialist models trained exclusively for specific tasks. In this paper, we investigate whether incorporating broad-coverage generalist instruction tuning can contribute to building a specialist model. We hypothesize that its efficacy depends on task specificity and skill requirements. Our experiments assess four target tasks with distinct coverage levels, revealing that integrating generalist instruction tuning consistently enhances model performance when the task coverage is broad. The effect is particularly pronounced when the amount of task-specific training data is limited. Further investigation into three target tasks focusing on different capabilities demonstrates that generalist instruction tuning improves understanding and reasoning abilities. However, for tasks requiring factual knowledge, generalist data containing hallucinatory information may negatively affect the model's performance. Overall, our work provides a systematic guide for developing specialist models with general instruction tuning. Our code and other related resources can be found at https://github.com/DavidFanzz/Generalist_or_Specialist.Comment: Accepted to EMNLP 202

    TediGAN: Text-Guided Diverse Face Image Generation and Manipulation

    Get PDF
    In this work, we propose TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions. The proposed method consists of three components: StyleGAN inversion module, visual-linguistic similarity learning, and instance-level optimization. The inversion module maps real images to the latent space of a well-trained StyleGAN. The visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. The instance-level optimization is for identity preservation in manipulation. Our model can produce diverse and high-quality images with an unprecedented resolution at 1024. Using a control mechanism based on style-mixing, our TediGAN inherently supports image synthesis with multi-modal inputs, such as sketches or semantic labels, with or without instance guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at https://github.com/weihaox/TediGAN.Comment: CVPR 2021. Code: https://github.com/weihaox/TediGAN Data: https://github.com/weihaox/Multi-Modal-CelebA-HQ Video: https://youtu.be/L8Na2f5viA

    Domain Fingerprints for No-reference Image Quality Assessment

    Get PDF
    Human fingerprints are detailed and nearly unique markers of human identity. Such a unique and stable fingerprint is also left on each acquired image. It can reveal how an image was degraded during the image acquisition procedure and thus is closely related to the quality of an image. In this work, we propose a new no-reference image quality assessment (NR-IQA) approach called domain-aware IQA (DA-IQA), which for the first time introduces the concept of domain fingerprint to the NR-IQA field. The domain fingerprint of an image is learned from image collections of different degradations and then used as the unique characteristics to identify the degradation sources and assess the quality of the image. To this end, we design a new domain-aware architecture, which enables simultaneous determination of both the distortion sources and the quality of an image. With the distortion in an image better characterized, the image quality can be more accurately assessed, as verified by extensive experiments, which show that the proposed DA-IQA performs better than almost all the compared state-of-the-art NR-IQA methods.Comment: accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT

    Match4Rec: A Novel Recommendation Algorithm Based on Bidirectional Encoder Representation with the Matching Task

    Full text link
    Characterizing users' interests accurately plays a significant role in an effective recommender system. The sequential recommender system can learn powerful hidden representations of users from successive user-item interactions and dynamic users' preferences. To analyze such sequential data, conventional methods mainly include Markov Chains (MCs) and Recurrent Neural Networks (RNNs). Recently, the use of self-attention mechanisms and bi-directional architectures have gained much attention. However, there still exists a major limitation in previous works that they only model the user's main purposes in the behavioral sequences separately and locally, and they lack the global representation of the user's whole sequential behavior. To address this limitation, we propose a novel bidirectional sequential recommendation algorithm that integrates the user's local purposes with the global preference by additive supervision of the matching task. We combine the mask task with the matching task in the training process of the bidirectional encoder. A new sample production method is also introduced to alleviate the effect of mask noise. Our proposed model can not only learn bidirectional semantics from users' behavioral sequences but also explicitly produces user representations to capture user's global preference. Extensive empirical studies demonstrate our approach considerably outperforms various state-of-the-art models.Comment: Accepted by ICONIP202

    Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition

    Full text link
    Current methods for few-shot action recognition mainly fall into the metric learning framework following ProtoNet. However, they either ignore the effect of representative prototypes or fail to enhance the prototypes with multimodal information adequately. In this work, we propose a novel Multimodal Prototype-Enhanced Network (MORN) to use the semantic information of label texts as multimodal information to enhance prototypes, including two modality flows. A CLIP visual encoder is introduced in the visual flow, and visual prototypes are computed by the Temporal-Relational CrossTransformer (TRX) module. A frozen CLIP text encoder is introduced in the text flow, and a semantic-enhanced module is used to enhance text features. After inflating, text prototypes are obtained. The final multimodal prototypes are then computed by a multimodal prototype-enhanced module. Besides, there exist no evaluation metrics to evaluate the quality of prototypes. To the best of our knowledge, we are the first to propose a prototype evaluation metric called Prototype Similarity Difference (PRIDE), which is used to evaluate the performance of prototypes in discriminating different categories. We conduct extensive experiments on four popular datasets. MORN achieves state-of-the-art results on HMDB51, UCF101, Kinetics and SSv2. MORN also performs well on PRIDE, and we explore the correlation between PRIDE and accuracy
    corecore